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Abstract 

The measurement error with normal distribution is universal in applications. 
Generally, smaller measurement error requires better instrument and higher test 
cost. In decision making based on attribute values of objects, wc shall select an 
attribute subset with appropriate measurement error to minimize the total test 
cost. Recently, error-range-based covering rough set with uniform distribution 
error was proposed to investigate this issue. However, the measurement errors 
satisfy normal distribution instead of uniform distribution which is rather simple 
for most applications. In this paper, we introduce normal distribution measure- 
ment errors to covering-based rough set model, and deal with test-cost-sensitive 
attribute reduction problem in this new model. The major contributions of 
this paper are four-fold. First, we build a new data model based on normal 
distribution measurement errors. With the new data model, the error range is 
an ellipse in a two-dimension space. Second, the covering-based rough set with 
normal distribution measurement errors is constructed through the "3-sigma" 
rule. Third, the test-cost-sensitive attribute reduction problem is redefined on 
this covering-based rough set. Fourth, a heuristic algorithm is proposed to deal 
with this problem. The algorithm is tested on ten UCI (University of California 
- Irvine) datasets. The experimental results show that the algorithm is more 
effective and efficient than the existing one. This study is a step toward realistic 
applications of cost-sensitive learning. 

Keywords: Normal distribution, measurement errors, test costs, 
covering-based rough set. 



1. Introduction 

The measurement error is the difference between a measurement value and 
its true value. It can come from the measuring instrument, from the item being 
measured, from the environment, from the operator, and from other sources 
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[T]. As a plausible distribution for measurement errors, the normal distribution 
was put forward by Gauss in 1809. In fact, normal distribution is found to be 
applicable over almost the whole of science and engineering measurement. In 
data mining applications, the data model based on measurement errors is an 
important form of uncertain data (see, e.g., 

Test costs refer to time, money, or other resources spent in obtaining data 
items related to some object El [3 |Hl IB UHl • There are a number of measure- 
ment methods with different test costs to obtain a data item. Generally, higher 
test cost is required to obtain data with smaller measurement error. In data 
mining applications, we shall select an attribute subset with appropriate mea- 
surement error to minimize the total test cost, and at the same time preserve 
necessary information of the original decision system. 

An attribute reduct is a subset of attributes that are jointly sufficient and 
individually necessary for preserving a particular property of the given infor- 
mation table [TT]. It is a key problem of rough set theory and has attracted 
much attention in recent years (see, e.g., [HI |T31 [TH [1^1 [H] ) . As a generaliza- 
tion of attribute reduction, test-cost-sensitive attribute reduction [3] focuses on 
selecting a set of tests to satisfy a minimal test cost criterion. 

Recently, error-range-based covering rough set was introduced to address 
error ranges. This theory is based on both covering-based rough set pniTBlfTOl 
[201 1211 121 123] and neighborhood rough set [21 [23 1211 123 [2H] • Moreover, in 
the new theory, the test-cost-sensitive attribute reduction problem deals with 
numeric data instead of nominal ones. Therefore the problem is more challenging 
than the one defined in [9]. However, error-range-based covering rough set 
considers only uniform distribution errors, which are rather unrealistic. 

In this paper, we introduce normal distribution to build a new model of 
covering-based rough set to address measurement errors (NDME) according 
to the "3-sigma" rule. The major contributions of this paper are four-fold. 
First, we introduce normal distribution to build a new data model based on 
measurement errors. With the new data model, the error range is an ellipse in 
a two-dimension space. The error range is computed according to the values 
of attributes instead of the fixed error range for different datasets. Second, 
we build the computational model, namely the covering-based rough set with 
normal distribution measurement errors. Third, the minimal test cost attribute 
reduction problem is redefined in the new model. Fourth, we propose a heuristic 
algorithm to address the reduction problem. Specifically, a (5-weighted heuristic 
reduction algorithm is designed, where attribute significance is adjusted by S- 
weighted test cost. 

Ten open datasets from the UCI library are employed to study the perfor- 
mance and effectiveness of our algorithm. We adopt three metrics to evaluate 
the performance of the reduction algorithms from a statistical viewpoint. Ex- 
periments undertaken with open source software Coser |29| validate the per- 
formance of this algorithm. Experimental results show that our algorithm can 
generate a minimal test cost reduct in most cases. At the same time, the pro- 
posed algorithm can achieve better performance and efficiency than the existing 
one |4]. 
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The rest of the paper is organized as follows: Section |2] presents the data 
models with measurement errors and test costs, respectively. Section[3]describes 
the computational model, namely covering-based rough set model with normal 
distribution measurement errors. The minimal test cost redact problem un- 
der the new model is also defined in this section. Next, Section |4] presents a 
5-weighted heuristic reduction algorithm and a competition approach. Experi- 
ment results and comparison with the existing work are discussed in Section [Sj 
Finally, conclusions are drawn in Section [6] 



2. Data models 



This section presents data models. First, we propose a decision system with 
normal distribution measurement errors, which is also called NEDS for brevity. 
Then, we introduce test costs to NEDS, and define test-cost-sensitive decision 
systems with NDME. 

2.1. Normal distribution measurement errors 

Normal distribution is symmetrical with a single central peak at the mean 
of the data [30] . It is described by the probability density function 

1 

/(2^) = ^5=^e-^, (1) 
V 27r(T^ 

where parameters /i is the mean and is the variance. 

The cumulative distribution function F(x) describes probability of a random 
variable falling in the interval (— oo,a;]. 

Fix) - / f{x)dx, (2) 

J — oo 

where x S K. 

For a random variable X, 

Pr{X < x) = F{x). (3) 

The standard normal distribution appears with ^ — and = 1. The 
equation becomes 

/(z) = ^e^. (4) 
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As shown in Figure [l] if your data obey a normal distribution, over 99% of 
your subjects will fall within three standard deviations of the mean. We use the 
following example to explain the relationship between standard deviation and 
confidence interval. 



Example 1. Let standard deviation be 0.01, the mean be 0, then we know that 
about 99% of the measurement errors from -0.03 to +0.03. 
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Figure 1: Confidence interval. 



2.2. Decision systems with normal distribution measurement errors 

We introduce normal distribution measurement errors into our model to 
expand the application scope. For a normal distribution, nearly all values lie 
within 3 standard deviations of the mean, that is "3-sigma" rule j30| . 

Definition 2. A decision system with normal distribution measurement errors 
(NEDS) S is the 6-tuple: 

S = {U,C,D,V ^{Va\aeCUD}, (5) 
I ^{Ia\aeCUD},n), 

where U is the nonempty set called a universe, C and D are the nonempty sets 
of variables called as conditional attributes and decision attributes, respectively. 
Va is the set of values for each a G C U D, and Ja : C/ —> 14 is an information 
function for each a G C U D. We often denote {Va\a £ C U D} and {Ia\a G 
C U D} by V and /, respectively, n : C ^ M+ U {0} is the maximum value of 
measurement error. +n(a) and —n{a) are the upper confidence limit(UCL) and 
the lower confidence limit(LCL) of a e C, respectively. 

Definition 3. Let S = (U, C, D, V, I, n) be a NEDS, the error range of attribute 
a is defined as 

n{a)=Ae{a), (6) 

where 

e(a) = A^^^i^, (7) 

TO 

where A £ [0, 1] is a user-specified parameter, and a{xi) is the i-th instance 
value oi a € C, i € [l,m], and to is the number of instance. The precision of 
e(a) can be adjusted through A setting. 

Obviously, if A = 0, a NEDS degrades to a decision system (DS). If A = 1 
and n{a) is a fixed value, a NEDS degrades to a decision system with error range 
(DS-ER) (see, e.g., [5). Therefore NEDS is a generalization of DS and DS-ER. 
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We introduce how to deal with the abnormal value of measurement error. 
In applications, if the repeated measurement data satisfy: 

\x^-x\> 3a,{i = 1,2,..., N), (8) 

the Xi would be considered as an abnormal value and be rejected. Where Xi is 
the i-th measurement value, and x is the mean of all measurement values. This 
is the Pauta criterion of measurement error theory. 

Now, we investigate the relationship between the limit of confidence interval 
and the standard deviation in the following proposition. 

Proposition 4. Let —n{a) and +n(a) he LCL and UCL, respectively, and Pr 
be the confidence level. We have the upper limit of confidence interval 

n{a} = 3a, (9) 

where Pr = 99.73%. 



The value of exceed the confidence interval based on 99.73% confidence level is 
an abnormal error, which needs to be identified and removed from consideration. 
The standard normal distribution is a special case of the normal distribution. 
The limit of confidence interval is investigated in the following proposition. 

Proposition 5. Let —n{a) and +n{a) be LCL and UCL of standard normal 
distribution measurement errors, respectively. We have 

n{a) = 3. (10) 

Proof. The standard normal distribution is given by taking fi = mean and 
tr^ = 1 in a general normal distribution. n(a) = 3cr, n{a) > 0. Therefore 



Equation (10 1 holds 



The adjusting factor A plays a key role in Definition [3] Related introduction 
is given by the following proposition. 

Proposition 6. Let ~n{a) and +n(a) be LCL and UCL of a € C, respectively. 
Confidence intervals are stated at the Pr confidence level, and n{a) = 3a. Ac- 
cording to Equation we have 

Pr{-n{a) <x< n{a)) = F{n(a)) - (1 - F{n{a))). (11) 

According to Equation ([s]) and Proposition |6] if 2/3 < A < 1, we have 
2a < n{a) < 3a, 95.45% < Pr < 99.73%; if 1/3 < A < 2/3, we have a < 
n{a) < 2a, 68.27% < Pr < 95.45%, and if < A < 1/3, we have < n{a) < a, 
0% <Pr< 68.27%. 

One can adjust the size of the neighborhood through the A setting to meet 
different requirements. 
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2.3. Test-cost-independent decision system with normal distribution measure- 
ment errors 

We introduce test costs to the data model. Now, we discuss the new model 
as follows: 

Definition 7. A test- cost-independent decision system with normal distribution 
measurement errors (TCI-NEDS) S is the 7-tuple: 

5=(C/,C,i^,F,/,n,c), (12) 

where U,C, D,V, I and n have the same meanings as in a NEDS, c : C — >■ 
R'^ U {0} is the test cost function. Test costs are independent of one another, 
that is, c{B) = J2aeB ^C"^) ^'^^ B C C. 

Note that in this model, test costs are not applicable to decision attributes. 

In order to processing and comparison, the values of conditional attributes 
are normalized from their value into a range from to 1. 

J {x — min) / {max — min) if max ^ min; 
^ 0.5 otherwise. 

where x is the initial value, y is the normalized value, and max and min are 
the maximal and minimal values of the attribute domain, respectively. 

Table [l] presents a decision system of Iris, which conditional attributes are 
normalized values. Where C = {SL, SW, PL, PW}, D = {Class}, and U = 
{xi,X2, . . . , a;i5o}. 



Table 1: An example numerical value attribute decision table. 



Patient 


SL 


SW 


PL 


PW 


Class 


Xi 


0.23529 


0.77273 


0.14286 


0.04762 


sctosa 




0.29412 


0.72727 


0.11905 


0.04762 


setosa 


xa 


0.35294 


0.09091 


0.38095 


0.42857 


versicolor 


X4, 


0.64706 


0.31818 


0.52381 


0.52381 


versicolor 


X5 


0.41176 


0.31818 


0.50000 


0.42857 


versicolor 


Xug 


0.58824 


0.54545 


0.85714 


1.00000 


virginica 


Xl50 


0.44118 


0.27273 


0.64286 


0.71429 


virginica 



3. Covering-based rough set with normal distribution measurement 
errors 

Rough set theory is a powerful tool for dealing with uncertain knowledge 
in information systems [31j. It has been successfully applied into feature selec- 
tion [32l |33] , rule extraction [34l [35l [36] , uncertainty reasoning [37l , deci- 
sion evaluation [35l |39l |40] , granular computing [HI |42l |43l |44j , etc. Recently, 
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covering-based rough set has attracted much research interest with significant 
achievements in both theory and apphcation. 
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Figure 2: Conventional neighborhoods. 

The concept of neighborhood (see, e.g., [23 EH HZ]) has been apphed to 
define difi^erent types of covering-based rough set [13l [191 ES] . From the differ- 
ent viewpoints, a neighborhood is caUed an information granule, or a covering 
element. Figure [2] illustrates the neighborhoods of a; in a two-dimension real 
space [IS]. For this neighborhood rough set model, 5 is a distance parameter 
and objects with a distance no further than i5 are viewed as neighbors. In this 
approach, (5 is a user-specified parameter. A new type of neighborhood is de- 
fined in [4], and Figure |3] illustrates this two-dimensional neighborhood. The 
size of the neighborhood depends on error ranges of tests, and more objects fall 
into the neighborhood of Xi for wider error ranges. 





I 


a, (x) + 2e(a, ) 




a^{x)-le{a^) 






ai(x) + 2e(a,) 
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a, (x) - 2e(a^ ] 





Figure 3: Two-dimensional neighborhood with error ranges. 

In this section, we introduce normal distribution measurement errors to 
covering-based rough set. The new model is called covering-based rough set 
with normal distribution measurement errors. If all attributes are error free, 
the data in a neighborhood are equivalent to each other. In this case, the 
covering-based rough set model degenerates to the classical one. Therefore, the 
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covering-based rough set with NDME is a natural extension of classical rough 
set. 



3.1. Covering-based rough set with normal distribution measurement errors 

According to "S-sigma" rule, we present a new model considering both error 
distribution and confidence interval. With Definition |2] a new neighborhood is 
defined as follows. 

Definition 8. Let S = {U, C, D, V, I, n) be a NEDS. Given x, E U and B C C, 
the neighborhood of Xi with respect to normal distribution measurement errors 
on test set B is defined as 

neix,) ^{xe U\ya e B, \a{x) - a{x,)\ < 2n{a)}, (14) 

where n(a) = Ae(a) is the error range based on confidence level of a. It repre- 
sents the error value of a in [— n(a), +n(a)]. 

Measurement errors with no more than a difference of 2n{a) should be viewed 
as the family of neighborhood granules. We explain why n{a) instead of e(a) 



was employed in Equation (14) as the maximal distance. Although the value 
of error is within a certain range, there are significant differences among confi- 
dence intervals. As mentioned earlier, "3-sigma" rule states that for a normal 
distribution, different proportion values lie within different standard deviations 
of the mean. Especially, the proportion is very close to if data is more than 
three standard deviations from the mean. Therefore, measurement errors with 
no more than a difference of n{a) =^ Ae(a) should be viewed the family of neigh- 
borhood granules. 

Sometimes we have a number of tests to obtain the same data item. Suppose 
some error ranges are known and others are unknown. The following proposition 
provides an estimation. 

Proposition 9. Let Oi and Oj be the measurement values for the same data 
item, \aj{x) — ai{x)\ < n' for any x G U. We have 

e{aj) < e{a,) + n' / A. (15) 

Proof. Let the true value of a; e U hea* (x) for a E B. Due to the measurement 
error, a*{x) — Ae(ai) < a{xi) < a*{x) + Ae(ai). aj{x) < ai{x) + n! < a*{x) + 
(Ae(ai) -I- n'); aj{x) > ai{x) + n' > a*{x) — (Ae(ai) -I- n'). Therefore e{aj) < 
e{ai) + n'/A. 

The shape of the neighborhoods is an ellipse for two-dimensional space. 
The two-dimensional block is depicted in Figure |4j Naturally, the size of the 
neighborhood depends on error ranges of tests and adjusting factor. Figure [5] 
shows the different sizes of neighborhood based on different adjusting factors. 

Now we discuss some fundamental issues of rough set in the new model. 
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Figure 5: Two-dimensional neighborhood with NDME based on different adjusting factors. 

Definition 10. Let S = {U,C,D,V,I,n) be a NEDS, Nb be a neighborhood 
relation induced by i? C C. We call < [/, Nb > a neighborhood approxima- 
tion space. For any X C- U, two subsets of objects, called lower and upper 
approximations of X in < U, Nb > , are defined as 

NsiX) - {x,\x, eUAriBix,) C X}; (16) 

iVs(X) = {x,\x, eUA nB{x,) n X ^ 0}; (17) 

yX C U, Nb{X) 5 X D Nb{X). The boundary region of X in the approxima- 
tion space is defined as 

BNb (X) = N^iX) - NsiX) . (18) 
The positive region of D with respect to B C C is defined as POSb{D) = 

\jxeu/D^{x) mm- 

3.2. Test-cost-sensitive attribute reduct problem 

Attribute reduction is a successful technique to remove redundant data and 
facilitate the mining task. A number of definitions of relative reducts exist 
[571 [501 [51] for different rough set models. In this section, we define test- 
cost-sensitive attribute reduction on the covering-based rough set model with 
NDME. 
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Definition 11. Let S = {U, C, D, V, I, n) be a NEDS, B C C and x e U. Any 
y € nsix) is called an inconsistent object in nsix) if D{y) ^ D{x). The set of 
inconsistent objects in nB{x) is 

icb{x) = {y e nB{x)\D{y) ^ D{x)}. (19) 

The number of inconsistent objects, namely |icB(a;)|, is important in evaluating 
the characteristics of the neighborhood block. It also influences the quality of 
rule induced by the block. 



From Definition 11 we know that given B C_ C , x € POSc{D) if and only 
if icii{x) = 0. Consequently, we have the following proposition, which can be 
employed as an alternative definition of a reduct. 

Proposition 12. Let S = {U,C,D,V,I,n) be a NEDS. Any R C C is a 
decision-relative reduct iff: 

1. Vx G POSc[D),icFt{x) = 0, and 

2. \/x e R,3xe POSc\d), st.icR_{a}{x) ^ 0. 

This proposition will help us in reduction algorithm designing, which as will 
be discussed in Section ID Sometimes we are interested in minimal reduction 
or minimal test cost reduct (see, e.g., [S]). In this work, we focus on finding 
reducts with minimal test cost, that is, test-cost-sensitive attribute reducts. 
Since TCI-NEDS is a generalization of NEDS, concepts in the latter model are 
also applicable to the former one. We propose the following concept. 

Definition 13. Let Red{S) denote the set of ah reducts of a TCLNEDS S = 
iU,C\D,V,I,n,c). Any R e RediS) where c{R) = min{c{R')\R' e Red{S)} is 
called a minimal test cost reduct. 

A minimal test cost reduct problem proposed in [9 can be redefined as follows. 
The problem of finding such a reduct is called the minimal test cost reduct 
problem. 

Problem 14. The minimal test cost reduct problem. 
Input: S = ([/, C, D, V, /, n, c); 
Output: B C C; 

Constraint: POSb{D) = POSc[D); 
Optimization objective: niin\c{B)\. 

Compared with the classical minimal reduction problem, there are several dif- 
ferences as follows. The first is the input, where the test costs and measurement 
errors are the external information. The second is the optimization objective, 
which is to minimize the test cost, instead of the number of features. We can 
adopt the addition-deletion strategy HU to design our heuristic reduction algo- 
rithm. 
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3.3. Evaluation measures 

Three evaluation measures are adopted to evaluate the performance of the 
proposed algorithm in order to dispel the influence of subjective and objective 
factors. We adopt the three measures proposed in [3] for this purpose. These 
are finding optimal factor (FOF), maximal exceeding factor (MEF), and average 
exceeding factor (AEF). 

Let K be the number of experiments and k be the number of searched 
optimal reduct in the experiments. The finding optimal factor is defined as 

op^^. (20) 

which is a both qualitative and quantitative measure. 

Let R' be an optimal reduct and R be the searched reduct. The exceeding 
factor indicating the badness of a reduct is a quantitative measure, defined as 

./(«)^^^^ 

The maximal exceeding factor describes the worst case of an algorithm, defined 
as 

max efiRi). (22) 

l<i<K 

The average exceeding factor is defined as 

^f-i^f(^^\ (23) 
which represents the whole performance of an algorithm. 



4. Algorithm 

Test-cost-sensitive attribute reduct problem is more complex than the tradi- 
tional reduct problem [4]. Heuristic algorithms are needed to find sub-optimal 
reducts for large datasets. To evaluate the performance of a heuristic algorithm 
in terms of the quality of the solution, we should find an optimal reduct from 
all reducts. Hence, exhaustive algorithms are also needed. 

In this section, we mainly present a heuristic algorithm and a competition 
approach to deal with the new problem. The exhaustive algorithm of [4] is 
adopted to find all reducts of datasets. It is based on backtracking where pruning 
techniques are crucial in reducing computation. 

4-.1. The S-weighted heuristic reduction algorithm 

To design a heuristic algorithm, we employ an algorithm framework which 
is similar to the one proposed in [3] . The algorithm follows the typical addition- 
deletion strategies [14] , which is listed in Algorithm [I] It constructs a super- 
reduct, and then reduces it to obtain a reduct. The algorithm is essentially 
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different from the one in [5]. First, the input S' is a TCI-NEDS instead of a 
test-cost-independent decision system (TCI-DS). Second, test resuhs are nu- 
merical rather than nominal The key code of this framework is listed in lines |5] 
and [7j and the attribute significance function is redefined to obtain respective 
algorithm. The efficiency of the 5-weighted heuristic reduction algorithm will 
be discussed in Section [5^ 

As previously mentioned, |icB(a;)| is useful in evaluating the quality of a 
neighborhood block. Now we propose the following concepts. 

Definition 15. Let S = {U, C, D, V, I, n) be a NEDS, B CC andxeU. The 
number of inconsistent objects in neighborhood UBix) is |zcb(2^)|- The total 
number of such objects with respect to U is 

ncsiS) = S]j;gt/|icB(a;)|, (24) 

and with respect to the positive region is 

PCb{S) = J:xePOSc{D)\-icB{x)\. (25) 

Finally, we propose a 5-weighted heuristic information function: 

/(B,a„c(a,)) = * + ^^, (26) 
c{ai) 

where <I> = pcs^S) — pcB\j{ai}{S) is necessary and indispensable, and it plays a 
dominant role in the heuristic information. Where c{ai) is the test cost of a^, 
and (5 > is a user-specified parameter. If 5 = 0, test costs are essentially not 
considered. If (5 > 0, tests with lower cost have bigger significance. Different 5 
settings can adjust the significance of test cost. 



^.2. The competition approach 

The competition approach has been discussed in [3] to obtain better results 
with more run-time. In the new environment, it is still valid because there is 
no universally optimal 8. In this approach, reducts complete against each other 
with only one winner, that is a reduct with minimal test cost, which can be 
obtained using S € L. 

Cl = minseLc{Rs), (27) 

where Rs is the reduct obtained by Algorithm [l] using the heuristic information, 
with L the set of user-specified 6 values. 

This approach requires more run-time because the algorithm run \L\ times 
with different 5 values. Since the heuristic algorithm is fast, it is acceptable 



for relatively small \L\. The results will be shown in Section 5.3 This simple 
approach can enhance the quality of the result significantly. 
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Algorithm 1 An addition-deletion test-cost-sensitive reduction algorithm. 
Input: {U,C,D,{Va}Ala}.n,c) 
Output: A redact with minimal test cost 
Method: 

1: 5 = 0; 

//Addition 
2: CA^C: 

3: while {POSb[D) ^ POSc{D)) do 
4: for each a e CA do 
5: Compute /(B, a, c); 
6: end for 

7: Select a' with the maximal /(i?, a', c); 
8: B = B\J{a'}■CA^CA^{a'}■ 
9■. end while 
//Deletion 

10: CD = B- 

11: while {CD ^ 0) do 

12: CD^CD- {a'}; 

13: if {POSB-{a'}{D) = POSb[D)) then 

14: B = B- {a'}; 

15: end if 

16: end while 

17: return B; 



5. Experiments 

5.1. Data generation 

Most datasets from the UCI library [55] have no intrinsic measurement errors 
and test costs. In order to help to study the performance of the reduction 
algorithm, we will create some data for experimentations. In this way, different 
parameters can be specified and data satisfying normal distributions can be 
employed. Unlike in simpler models, data should not be randomly generated, 
but meet certain constraints. For example, measurement errors satisfy normal 
distribution and Pauta criterion. For the same data item, tests with narrower 
error ranges should be more expensive. In this section, we will discuss both 
the process and substantial settings of data generation. Constraints mentioned 
above are met in this process. 

Step 1. We choose ten datasets from the UCI Repository of Machine Learn- 
ing Databases, as listed in Table [2] Each dataset should contain exactly one 
decision attribute, and have no missing value. To make the data easier to han- 
dle, data items are normalized from their value into a range from to 1. Missing 
values are directly set to 0.5. 

Step 2. We produce the number of additional tests for one particular data 
item. We use the uniform distribution generator [3] to generate the random 
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Table 2: Database information. 



No. 


Name 


Domain 


\u\ 


\C\ 


\C'\ 




1 


Iris 


zoology 


150 


4 


4 


class 


2 


Glass 


manufacture 


214 


9 


13 


type 


3 


Wine 


agriculture 


178 


13 


21 


class 


4 


Wpbc 


clinic 


198 


33 


65 


outcome 


5 


Wdbc 


clinic 


569 


30 


58 


diagnosis 


6 


Credit 


commerce 


690 


15 


23 


class 


7 


Image 


graphics 


210 


19 


30 


class 


8 


lono 


physics 


351 


34 


68 


class 


9 


Liver 


clinic 


345 


6 


8 


selector 


10 


Diab 


clinic 


768 


8 


12 


class 



integers in the range [0, k]. That is, we have 1 to (k + 1) measurement methods 
to obtain values for each data item; k is set to less than 5 in our experiments. 
The number of tests for our experiments is \C'\ in Table [2] 



Table 3: Generated error ranges for different databases. 



Datasets 


Minimal 


Maximal 


Average 


Iris 


0.0042 


0.0044 


0.0043 


Glass 


0.0005 


0.0059 


0.0030 


Wine 


0.0031 


0.0053 


0.0040 


Wpbc 


0.0011 


0.0056 


0.0031 


Wdbc 


0.0006 


0.0040 


0.0023 


Credit 


0.0001 


0.0056 


0.0022 


Image 


0.0001 


0.0065 


0.0026 


lono 


0.0045 


0.0087 


0.0061 


Liver 


0.0011 


0.0065 


0.0029 


Diab 


0.0009 


0.0059 


0.0031 



Step 3. We produce the e(a) for each original test according to Equation 
Q. The e(a) is computed according to the value of databases without any 
subjectivity. Three kinds of error ranges of different databases are shown in 
Table [3j These error ranges are maximal, minimal and average error ranges of 
all attributes, respectively. The precision of e(a) can be adjusted through A 
setting, and we set A to be 0.01 in our experiments. 

Step 4. We produce "new" data subject to error ranges. Let ai be the 
original test, according to Proposition [9j we can add a random number in [—{i — 
l)n(a),(i — l)n{a)] to ai{x) to produce ai{x), where x € U. The number is 
generated as follows. 
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Let xi and X2 be uniformly distributed on (0, 1), then 

yi{xi,X2) = \/-2lnxicos{2TTX2) (28) 

is a random number which has a normal distribution with /i = mean and 
cr^ = 1. From Proposition |4] we know the n{a) = 3a, and a = |n(a). 
Since we need a random number in [— n(a), +n(a)], we let 

2/(n(a),xi,X2) = ^yi{xi,X2)'n{a). (29) 

Finally 

{— n(a) ify<— n(a); 

n{a) i{y>n{a); (30) 

j/(n(a), xi, X2) otherwise. 

is a random number which has a normal distribution with ^ = mean and 
a = |n(a). According to Definition |8j a,; is the new test with error range 
±z * n(a). 



300 




Generated measurement errors x10'' 
Figure 6: Normal distribution measurement errors with different error ranges. 

The generated NDME with different error ranges are shown in Figure [6j The 
generated NDME of different databases are shown in Figure [7] 

Step 5. We produce test costs, which are always represented by positive 
integers. Let ai be the original test and ai be the last test for one particular 
data item. c(a/) is set to a random number in [1, 100] subject to the uniform 
distribution. c{ai) where 1 < i < Hs set to 2 x c(ai+i). This setting guarantees 
that tests with narrower error ranges are more expensive. 

A dataset generated by this approach is listed in Table |4j SL stands for 
sepal length, SW stands for sepal width, PL stands for petal length, and PW 
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Table 4: A generated measurement error vector and a generated test cost vector(Iris). 



a 


SL 


SL-1 


SW 


PL 


PL-1 


PL-2 


PW 


Original test 


True 


False 


True 


True 


False 


False 


True 


e(a) 


0.0043 


0.0086 


0.0041 


0.0041 


0.0082 


0.0123 


0.0041 


c(a) 


28 


14 


81 


376 


188 


94 


91 



stands for petal width. 1 or 2 after SL and PL indicate different revisions of the 
original data. There is only one method to obtain SW and PW. 

5.2. Effectiveness of the heuristic algorithm 

Let S — 2,3,4:, ... ,9. The algorithm runs 800 times with different test cost 
settings and different S setting on all datasets. Figures |8] and [9] show the results 
of finding optimal factors. For different settings of 6, the performance of the 
algorithm is completely different, that is, the test cost plays a key role in this 
heuristic algorithm. Data for S — are not included in the experiment results 
because respective results are incomparable to others. 



Figures 10 and 11 show the results of maximal exceeding factors, which pro- 
vide the worst case of the algorithm, and they should be viewed as a statistical 
measure. Figures [T2] and [T3] show the average exceeding factors. These display 
the overall performance of the algorithm from a statistical perspective. 

From the results we observe that the quality of the results varies for different 
datasets. It is related to the dataset itself because the error range and heuristic 
information are all computed according to the values of dataset. Then the 
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Figure 8: Finding optimal factor (datasets 1-5). 
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Figure 9: Finding optimal factor (datasets 6-10). 



average exceeding factor is less than 0.3 in most cases. In other words, the results 
are acceptable. Although the results are generally acceptable, the performance 
of the algorithm should be improved. Section [5.3| will address this issue further. 



5.3. Comparison of three approaches 

Now we compare the performance of the proposed algorithm through three 
approaches mentioned in Section[4] The first approach, called the non- weighting 
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Figure 10: Maximal exceeding factor (datasets 1-5). 
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Figure 11: Maximal exceeding factor (datasets 6-10). 



approach, is implemented by setting 6 = 0. This approach is the only one 
without taking into account test costs. The second approach, called the best S 
approach, is to choose the best 5 value as depicted in Figures [8| through [T3l The 
third approach is the competition approach discussed in Section 4.2 All three 
are based on Algorithm [l] and the same databases. 

Table [5] lists results for all three approaches. From Table [5j we observe the 
following results: 
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Figure 13: Average exceeding factor (datasets 6-10). 



(1) The non-weighting approach almost does not find the optimal reduct. 
Therefore without considering test costs is not suitable for the minimal 
test cost reduct problem. 

(2) In most cases, the best S approach obtains good results. However, we 
have no idea how to obtain the best value of S in real applications. 

(3) The competition approach significantly improves the quality of results 
with more run-time, which is acceptable for relatively small number of S. 
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Table 5: Results for 5 = 0,5 with the optimal setting, and 5 with a number of choices. 



Dataset 




FOF 






MEF 






AEF 




(5 = 


6 = 6* 


6 eL 


(5 = 


6 = 6* 


6 eL 


,5 = 


6 = 6* 


6e L 


Iris 


0.170 


0.940 


0.940 


2.000 


0.100 


0.100 


0.360 


0.003 


0.003 


Glass 


0.090 


0.570 


0.640 


3.220 


0.374 


0.374 


0.700 


0.064 


0.049 


Wine 


0.000 


0.900 


0.940 


19.44 


0.423 


0.423 


4.464 


0.021 


0.014 


Wpbc 


0.000 


0.840 


0.880 


45.67 


0.300 


0.250 


14.50 


0.033 


0.017 


Wdbc 


0.000 


0.710 


0.760 


93.20 


0.500 


0.500 


14.61 


0.041 


0.037 


Credit 


0.000 


0.520 


0.550 


2.188 


0.317 


0.310 


1.095 


0.053 


0.049 


Image 


0.000 


0.680 


0.790 


31.43 


0.406 


0.269 


5.417 


0.053 


0.032 


lono 


0.000 


0.500 


0.630 


46.60 


0.765 


0.544 


10.28 


0.084 


0.054 


Liver 


0.040 


0.780 


0.910 


4.125 


0.275 


0.181 


0.921 


0.023 


0.008 


Diab 


0.000 


0.640 


0.700 


3.788 


0.481 


0.481 


1.278 


0.048 


0.033 



5.4- Comparison with existing algorithm 

Compared with an existing model [1], the major improvement is introduced 
in this section. 

First, the NDME was considered to data model, and covering-based rough 
set based on NDME has been proposed. In most cases, the measurement er- 
rors satisfy normal distribution instead of uniform distribution; hence, this new 
model has wider application areas. 

Second, comparing with the fix error range of different databases from [1] , the 
proposed error ranges are adaptively generated according to the database values. 
Table |3] shows the generated error ranges for different databases. The error 
ranges for different attributes of the same database are completely different. 
For example, the maximal error range of Wdbc is 0.0040, and the minimal one 
is 0.0006. 

Third, a (5-weighted heuristic algorithm is developed to deal with the minimal 
test cost reduct problem. Our algorithm is compared with the A-weighted algo- 
rithm W from effectiveness and efficiency. Since two different algorithms have 
different parameters, we compare the results of the competition approach on 
ten datasets. Figure [14] shows competition approach results of two algorithms. 
From the results we observe that 

(1) . On Wpbc and lono datasets, two algorithms have same performance. 

(2) . A-weighted algorithm has better performance than our algorithm on 
Iris, Class and Credit datasets. 

(3) . However, our algorithm performs better than the A-weighted algorithm 
on five datasets. 

The efficiency comparison between the (5- weighted algorithm and A-weighted 



one is depicted in Figure 15 From the results we note that our algorithm has 



an improvement in terms of run-time. Figure 16 shows the efficiency ratios of 
the (5-weighted algorithm and the A-weighted algorithm. 
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Figure 14: Competition approach results of two algorithms. 
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Figure 15: Efficiency comparison. 



6. Conclusion 

In rough set model, measurement errors and test costs are all intrinsic to 
data. In this paper, we built a new covering-based rough set model considering 
measurement errors and test costs at four levels: 

1. At the data model level, a new data model with NDME and test cost was 
proposed. This model has more application areas because the measure- 
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Figure 16: Improving efficiency ratio. 

ment errors have certain universality. 

2. At the computational model level, we introduced a covering-based rough 
set with NDME. This model is generally more complex than that pre- 
sented in this field. 

3. At the problem level, a minimal test cost reduct problem based on the 
new model was redefined. 

4. At the algorithm level, a (5-weighted heuristic algorithm was developed 
to deal with this reduct problem. Experimental results indicate the effec- 
tiveness and efficiency of the algorithm. 

In summary, the data model based on normal distribution measurement 
errors has wide application scope. This study suggests new research trends of 
covering-based rough set and cost-sensitive learning. 
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